Agregar identificación y estructuración de contenido especial del texto#64
Open
eduranm wants to merge 15 commits intoscieloorg:mainfrom
Open
Agregar identificación y estructuración de contenido especial del texto#64eduranm wants to merge 15 commits intoscieloorg:mainfrom
eduranm wants to merge 15 commits intoscieloorg:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Este PR agrega soporte en markup_doc para identificar y estructurar contenido especial dentro del cuerpo del documento (figuras/imágenes, tablas, listas y fórmulas) a partir de DOCX, integrándolo con Wagtail (StreamFields) y el pipeline de procesamiento (Celery tasks).
Changes:
- Se incorpora un extractor DOCX (
markuplib/function_docx.py) que detecta imágenes, tablas, listas y fórmulas y las serializa como objetos intermedios. - Se agrega una app Django/Wagtail
markup_doc(modelos, hooks de admin, tasks, utilidades) para persistir y operar el marcado estructurado. - Se amplía el router de API y settings para exponer endpoints internos y registrar apps necesarias.
Reviewed changes
Copilot reviewed 19 out of 28 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| model_ai/llama.py | Ajuste en generación con Gemini (incluye pausa fija). |
| markuplib/function_docx.py | Extracción de contenido especial desde DOCX (imágenes/tablas/listas/fórmulas). |
| markuplib/init.py | Inicialización de paquete. |
| markup_doc/wagtail_hooks.py | Registro de ViewSets en Wagtail admin y disparo de tasks/syncs. |
| markup_doc/tests.py | Placeholder de tests. |
| markup_doc/tasks.py | Task principal get_labels que transforma el contenido DOCX a StreamFields, incluyendo “special content”. |
| markup_doc/sync_api.py | Sincronización de colecciones/journals desde API externa. |
| markup_doc/models.py | Modelos y StreamFields para front/body/back y bloques (image/table/compound). |
| markup_doc/migrations/0001_initial.py | Migración inicial para la app markup_doc. |
| markup_doc/migrations/0002_alter_articledocx_estatus_and_more.py | Ajuste de choices/default para estatus. |
| markup_doc/marker.py | Wrapper para marcación vía LLM (primer bloque / metadatos). |
| markup_doc/labeling_utils.py | Utilidades de etiquetado, referencias, y construcción de objetos para contenido especial. |
| markup_doc/forms.py | Form base de admin. |
| markup_doc/choices.py | Choices de etiquetas y reglas base de orden/estilo. |
| markup_doc/apps.py | AppConfig de markup_doc. |
| markup_doc/api/v1/views.py | Endpoint autenticado para “first_block” (marcación de texto por metadata). |
| markup_doc/api/v1/serializers.py | Serializer base para ArticleDocx. |
| markup_doc/api/v1/init.py | Inicialización de módulo. |
| markup_doc/api/init.py | Inicialización de módulo. |
| markup_doc/admin.py | Placeholder admin. |
| markup_doc/init.py | Inicialización de paquete. |
| markup_doc/migrations/init.py | Inicialización de migraciones. |
| fixtures/e14790.docx | Fixture DOCX de ejemplo. |
| config/settings/base.py | Registro de markup_doc y markuplib en INSTALLED_APPS. |
| config/api_router.py | Registro del endpoint first_block en el router de DRF. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+367
to
+375
| # Determina si es parte de una lista | ||
| is_numPr = paragraph.find('.//w:numPr', namespaces=paragraph.nsmap) is not None | ||
|
|
||
| # obtiene id y nivel | ||
| if is_numPr: | ||
| numPr = paragraph.find('.//w:numPr', namespaces=paragraph.nsmap) | ||
| numId = numPr.find('.//w:numId', namespaces=paragraph.nsmap).get(namespaces_p + 'val') | ||
| type = [(key, objt) for key, objt in list_types.items() if objt['numId'] == numId] | ||
|
|
Comment on lines
+359
to
+363
| obj['type'] = 'formula' | ||
| obj['formula'] = etree.tostring(mathml_root, pretty_print=True, encoding='unicode') | ||
|
|
||
|
|
||
| if not obj_image: |
Comment on lines
+325
to
+347
| for drawing in element.findall('.//w:drawing', namespaces=namespaces): | ||
| if drawing.find('.//a:blip', namespaces=namespaces) is not None: | ||
| blip = drawing.find('.//a:blip', namespaces=namespaces) | ||
| if blip is not None: | ||
| obj_image = True | ||
|
|
||
| rId = blip.get('{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed') | ||
| image_part = doc.part.related_parts[rId] | ||
| image_data = image_part.blob | ||
| image_name = image_part.partname.split('/')[-1] | ||
|
|
||
| if image_name not in images: | ||
| images.append(image_name) | ||
|
|
||
| # Guardar la imagen en Wagtail | ||
| wagtail_image = ImageModel.objects.create( | ||
| title=image_name, | ||
| file=ContentFile(image_data, name=image_name) | ||
| ) | ||
|
|
||
| # Referenciar la imagen guardada en el objeto | ||
| obj['type'] = 'image' | ||
| obj['image'] = wagtail_image.id |
Comment on lines
+571
to
+585
| elif isinstance(element, CT_Tbl): | ||
| namespaces = { | ||
| 'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main', | ||
| 'a': 'http://schemas.openxmlformats.org/drawingml/2006/main', | ||
| 'r': 'http://schemas.openxmlformats.org/officeDocument/2006/relationships' | ||
| } | ||
|
|
||
| table = element | ||
| table_data = extrae_Tabla(element, hiperlinks_info, namespaces) | ||
| obj = {} | ||
| obj['type'] = 'table' | ||
| obj['table'] = table_data | ||
|
|
||
| if not is_numPr: | ||
| content.append(obj) |
Comment on lines
+428
to
+436
| def update(cls, title, estatus): | ||
| try: | ||
| obj = cls.get(title=title) | ||
| except (cls.DoesNotExist, ValueError): | ||
| pass | ||
|
|
||
| obj.estatus = estatus | ||
| obj.save() | ||
| return obj |
Comment on lines
+1126
to
+1132
|
|
||
| id = search_special_id(data_body, label) | ||
|
|
||
| res.append({ | ||
| "label": label, | ||
| "id": id, | ||
| "reftype": dict_type.get(id[0].lower(), 'other') |
Comment on lines
+81
to
+89
| def update(cls, title, estatus): | ||
| try: | ||
| obj = cls.get(title=title) | ||
| except (cls.DoesNotExist, ValueError): | ||
| pass | ||
|
|
||
| obj.estatus = estatus | ||
| obj.save() | ||
| return obj |
Comment on lines
+21
to
+28
| def get_llm_model_name(): | ||
| # FIXME: This function always fetches the first LlamaModel instance. | ||
| model_ai = LlamaModel.objects.first() | ||
|
|
||
| if model_ai.api_key_gemini: | ||
| return MODEL_NAME_GEMINI | ||
| else: | ||
| return MODEL_NAME_LLAMA |
Comment on lines
+794
to
+798
| if model.name_file: | ||
| user = User.objects.get(pk=user_id) | ||
| refresh = RefreshToken.for_user(user) | ||
| access_token = refresh.access_token | ||
|
|
Comment on lines
+701
to
+705
| if not result: | ||
| result = {'label': '<p>', 'body': state['body'], 'back': state['back']} | ||
| state['label'] = result.get('label') | ||
| state['body'] = result.get('body') | ||
| state['back'] = result.get('back') |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
O que esse PR faz?
Agrega soporte para identificar y estructurar contenido especial dentro del cuerpo del documento en
markup_doc.Incluye:
Onde a revisão poderia começar?
Por commits
Como este poderia ser testado manualmente?
Levantar el entorno;
Cargar un DOCX con tablas, imágenes o fórmulas;
Verificar que estos elementos se detecten y se agreguen estructurados;
Confirmar que el flujo previo del documento no se rompa con estos nuevos tipos.
Algum cenário de contexto que queira dar?
Se enfoca únicamente en contenido especial del cuerpo del documento, dejando para PRs posteriores la salida XML, previsualización y empaquetado.
Screenshots
N/A
Quais são tickets relevantes?
#63
Referências